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Abstract: In this paper we propose a Bayesian, information theoretic ap- 
proach to dimensionality reduction. The approach is formulated as a varia- 
tional principle on mutual information, and seamlessly addresses the notions 
of sufficiency, relevance, and representation. Maximally informative statistics 
are shown to minimize a Kullback-Leibler distance between posterior distri- 
butions. Illustrating the approach, we derive the maximally informative one 
dimensional statistic for a random sample from the Cauchy distribution. 
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1 Introduction 

Dimensionality reduction is a fundamental goal of statistical science. In a 
modeling context, this is often facilitated by estimating a low dimensional 
quantity of interest. For example, suppose the quantities of interest are the 
labels of a classification of photographs of objects; of trees, children, etc. 
The data are the photographs, and the goal is to infer which of the several 
classes have been presented. In this case the data space often has dimension 
on the order of > 10 6 , while the parameter space is a small discrete set of 
labels each having much lower dimension. A low dimensional summary of 
the photograph is then obtained as the estimate of the classification of the 
photograph. 

In this paper, we propose a novel fully Bayesian information theoretic 
approach to dimensionality reduction, based on maximizing the mutual in- 
formation between a statistic and a quantity of interest. The approach is for- 
mulated as a variational principle on mutual information, and it seamlessly 
addresses the notions of sufficiency, relevance, and representation. We refer 
to statistics which maximize this mutual information as maximally informa- 
tive (MI) statistics. Such statistics are shown to minimize a Kullback-Leibler 
distance between posterior distributions. 

The mutual information between a statistic and a quantity of interest 
is defined in section [|. The mutual information based variational principle 
for MI statistics is utilized in section |3] to derive non-variational derivative 
forms of the principle. In section f| several properties of MI statistics are 
derived. The important result of this section is that MI statistics provide 
a generalization of the notion of sufficiency, because they are sensible both 
when they are not sufficient statistics, and when lower-than-data-dimension 
sufficient statistics do not exist. In section [| we present the result that 
in inference the Kullback-Leibler (KL) distance is properly a functional of 
posterior distributions. There we find MI statistics at functional minima of 
a KL distance based on posterior distributions of the parameter of interest. 
The arguments made here suggest that the KL distance derived here is to 
be preferred to a maximum relative entropy distance, a fact which is not 
discussed in, for example, Kullback H or Shor ||, and numerous others. In 
section |6] the MI static for the location parameter of the univariate Gaussian 
distribution is derived, and shown to be the expected result, since in this 
case a one- dimensional sufficient statistic exists. In section ^ we find a one- 
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dimensional MI statistic for the Cauchy distribution, where a sufficiency 
reduction does not exist. In section || we discuss approximating the posterior 
distribution as a Gaussian and apply this technique to show that the MI 
statistics are then Bayes' estimators of the mean and standard deviation. In 
that section a contrast of the approximate MI inference approach with the 
Maximum Entropy method is made, and it is shown that although they agree 
for Gaussian likelihoods, they disagree for other distributions, with simplicity 
arguing in favor of the MI statistics. 



2 The Mutual Information Between a Statistic and a Quantity of 
Interest 

Let the data x G X be drawn according to a parameterized distribution 
P(x | 0), with e 0, the parameter space. itself is distributed according 
to the prior P{0). The marginal distribution of x is obtained from P(x) = 
J P(x | 0) P(0) d0, and the posterior of given x is obtained from Bayes 
Theorem as 

The quantity of interest q = £q(0) will be a function of 0, a mapping 
from the parameter space 6 into some Q, £q(-) : 6 — > Q. It will be useful 
to use the Dirac delta-function S(-) to represent the distribution of q as 

P{q\0) := P{{0:q = S Q {9)}\9) 

= 5(q-£ Q (9)) (2) 

= n£ 1 %-e g ,i(e)) ) (3) 

where 8(z(-)) = UiS(zi(-)). Note that (fj) may be seen directly by using 
Bayes' theorem to expand P(q, 6) as P(q \ 6) P{&), integrating that over q, 
which must produce P(0), and noting that because the support of P(q \ 6) 
is the unique q such that q = £q(0) (0 is specified), P(q \ 6) must therefore 
be the Dirac delta function. The distribution of q given the data x, may be 
written using ([[]) and (|3]) as 

P(q | x) = [ P(q\ 6) P(0 | x) dO (4) 
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A statistic r = £r(x) will be a function of x, a mapping from the data 
space X into some R, : X — > R. Again using the delta notation, the 

distribution of the statistic given data is 

P(r\x) = 5(r-U*)) (5) 

= ukAn - £*,,(*)) (6) 

The joint distribution of the statistic r and the quantity of interest q, con- 
ditioned on the data x is 

P(r, q\x)=P(r\ x) P(q | x) (7) 

(since r = £r(x) is specified once x is known, making P(r \ x, q) — P{r \ 
x)), and the unconditional joint distribution is 

P(r, q) = j P(r \ x) P(q | x)P{x)dx. (8) 

Finally, we define the mutual information between a statistic and a 
quantity of interest Cq(') as 

M{U-)M-)) = JJ P(r,q) log (p^^y) dqdr (9) 

This mutual information is the Kullback-Leibler distance between the joint 
distribution P(r, q) and the marginal product P(r)P(q) corresponding to 
independence between r and q. Note that this Kullback-Leibler distance is 
different from the Kullback-Leibler distance mentioned in the introduction 
(and seen later in section |5[). A major contribution of this paper is the 
demonstration of how these two Kullback-Leibler distances are related. 

3 MI Statistics and the Variational Principle 

We now define the maximally informative (MI) statistic. 

Let S = {£#(•)} be a set of statistics under consideration. A MI 
statistic for a quantity of interest $n(-) is any statistic from 
S maximizing the mutual information M(£ R (-), £q(-)) between the 
statistic and the quantity of interest. 
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The following variational principle can be used to obtain an MI statistic. 
Let denote the functional derivative with respect to /(•). 

Choose £ R {-) from S such that 6M ^ f^Q^ = q a nd ^(-)4 q (-)) 

is negative semidefinite, i.e. so that£ R (-) maximizes the informa- 
tion between itself and £q(") , the quantity of interest. If possible, 
choose the global maximum. 

Note that MI statistics in S may occur on the boundary of S. This 
may be a case of interest, which occurs when constraints are imposed on the 
statistics, and may be handled with a trivial modification. Note also that 
the space S of statistics may be constrained to contain only low-dimensional 
statistics, in order to force a dimesionality reduction of the data. 

We now demonstrate the variational principle for MI statistics. The ar- 
gument proceeds by varying (see, for example |3J for the variational calculus) 
the mutual information of ([]) with respect to the statistic function £#(-) of 
dimension k r , i.e. £ R (-) = (£r,i(")> ■ ■ ■ , Cr,k r ('))- We now proceed to substitute 
£n(x) = $,° R (x) + erj(x) in (^), and take the derivative with respect to e. 

Assuming appropriate regularity conditions, we have 



d e P(r, q)log 



' P(r,q) 
P(r)P(q) 



+ P(r)d e P(q | r) 



dqdr (10) 



= [ j d S {r , q) lo 9 y^} dqd r, (11) 

where simplification from (10) to fllTl) occurs because probability is conserved. 
Utilizing (^) we find 

P(r, q)= J 5(r - $ R (x)) P(q | x) P{x) dx (12) 

Taking the derivative of (|l"2"D with respect to e yields 

d t P(r, q) = E^i / S'( rj - C^i^vA^^Ari ~ Cn,i(x))P(q \ x)P{x) dx 

(13) 
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Note that because rj is arbitrary, we may choose it to simplify as needed. 

We proceed by considering k r choices of r\. Label the choices by m G 
{1, . . . , k r }, and on choice m take the components of rj as follows: 

r] t (x) = 5(x-x c ), (£ = m) (14) 
Vt (x) = 0, {l±m) (15) 

where x c is any data point we may choose. The condition that the mutual 
information is extremal then becomes the statement that for all x c and i G 

9eM(U-),W)\e=0 = (16) 

xP{q\x c )log( ^^ dqdr (17) 

Integrating ( |l~7l) by parts with respect to r (dropping both the "0" su- 
perscript and subscript "c" , since there is no distinction to be made at this 
point) yields the condition that for all x 

J P(q | x) drlog (p^^y) \ r= U*) dq = ° 

where derivatives with respect to vectors are gradients (vectors of deriva- 
tives). The form from which the theorems of the next section are proven, is 
found by rewriting ([18]) as 

4 MI Statistics and Sufficiency 

Now we prove several important properties concerning MI statistics. The 
first property is the intuitively obvious property that data is a MI statistic. 
The second property is that any sufficient statistic is a MI statistic. Finally, 
we note that MI statistics are not necessarily sufficient statistics. 
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Theorem 1 . Any 1-1 function of data is a MI statistic of the 
quantity of interest 

Proof: Let £ R (-) be the identity so that £ R (x) = x in (|TP|). The 
fraction in that equation is then 1, and the derivative integrates 
to zero because probability is conserved. Having £ R (x) any in- 
vertible function changes nothing as any value of it determines 
x. 

Theorem 2 . Any sufficient statistic for the quantity of interest 
is a MI statistic of the quantity of interest 

Proof: Note that, using the definition of $, R {x) being a sufficient 
statistic, the ratio in (|19D is one - the posterior distribution of the 
quantity of interest given the data x is the same as the posterior 
distribution of the quantity of interest given the sufficient statistic 
$ R (x). The derivative then integrates to zero because probability 
is conserved. 

(Note that in both Theorems 1 and 2 the Hessian condition of the MI 
inference variational principle is easily established since then the extremum 
of the mutual information is easily seen to be a local maximum. Otherwise, 
one must check convexity.) 

Although it is true that any sufficient statistic is a MI statistic, the con- 
verse is false. In problems (of data dimension greater than one) where a 
lower-than-data dimension sufficient statistic does not exist, there will exist 
a lower-than-data dimension statistic which is MI but not sufficient. Thus, 
the class of maximally informative statistics contains the sufficient statistics, 
but is broader. MI statistics need not provide all of the available information 
about the underlying quantity of interest. For example, as we show in Sec- 
tion such a lower-than-data dimension MI statistic can be obtained for the 
Cauchy distribution where a lower-than-data dimension sufficient statistic is 
a-priori unavailable. In this manner, MI statistics seamlessly address rele- 
vance to the consumer of the information because it is about some relevant 
quantity of interest that MI statistics are maximally informative. 
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5 MI Statistics and the KL Distance 



Equation (|T9|) may be rewritten as 




Pjq I x) 
P{q | r) 



) 



dq 



r=€ R {x) 







(20) 



which, along with the curvature condition, states that 

Theorem 3. The Kullback-Leibler distance between the posterior 
distribution conditioned on the statistic and the posterior distri- 
bution conditioned on the data is minimized by a MI statistic. 

Again, note that MI statistics for the quantity of interest are generally not 
sufficient statistics for the quantity of interest. Indeed, rather than making 
the Kullback-Leibler distance zero, as in the case of sufficient statistics, MI 
statistics are found at local minima of the Kullback-Liebler distance - viewed 
as a functional of the statistic. This demonstrates how the approach of this 
paper generalizes that performed by Lindley ||. 

6 MI Statistics for the Gaussian distribution 

This section details the inference of the one-dimensional MI statistic for the 
one-dimensional Gaussian distribution. We take the position parameter of 
the Gaussian to be q, and the the goal is to find £,r(x) so that ( |T9"D holds. 
From there note that the calculation of P(q \ r) and P(q \ x) is necessary, 
and by Bayes' theorem therefore it is necessary to find P(r | q), which may 
be written as 



The ansatz £,r(x) = J2i=i \ x i is useful (and not resrictive since the \' { s are 
implicitly only restricted to be functions of x), and making the changes of 




(21) 
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variables yi = \xi followed by Ui = \q in (^) yields a form which may 
immediately be recognized as the convolution of N Gaussians with means 
Hi = \q and standard deviations cr, = \a respectively, 

, N N N -u 2 j2(a\) 2 

P(r\q) = /5(r-5>g-]>» JJ — =—_ d«. (22) 
J i=l i=i i= i V27r(A i o-j 



This has the solution 



P(r | g) = 0(0, </)(r (23) 

i=l 



where a' = cryJ2i=i A| and </>(•, •)(•) is the Gaussian density 

0(^,a)(z) = -^e-^ 2 / 2CT2 . (24) 
y lira 



Finally, inserting this result into Bayes' theorem with uniform prior to find 
the posterior distribution of q conditioned on r yields 

N 

P(q\r)=S<P(0,a')(r-J2Kq) (25) 

i=i 

where S := A*. 

The calculation for P(q \ x) is similar with the result is that 

P(q\x)=<P(x,^=)(q) (26) 



where x := jfYliLi x i- From the forms of (|25 ) and (|26"D it is clear that 
not only will the integrand of ( ^0|) (that equation equivalent to (|19D ) be 
minimized, but that it will be zero, if all Aj = 1/N is chosen. This of course 
is the expected result since £_r(x) = Yli=i x i/N is a sufficient statistic for q 
when a is known. 

Alternatively, to satisfy that the calculation indicated in (|T9|) is successful 
at finding the expected result, continue by taking ( p5[ ) and (p6| ) and substi- 
tuting them into (pi]) to find after some simplification the equation which 
must be satisfied by £r 







N _ ■ 

-qJ2*i)e- {X - q)2/2{a/ ^ 2) dq 

i=l 



\r=H R {X) ■ (27) 
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This has the unique solution £,r(x) = x when the arbitrary scale of the 
inferred statistic is fixed by setting 1 = X)i=i -V To conclude this section, 
the procedure culminating in ( |2"0"D or ([TJ5]) of finding MI statistics has been 
shown to produce the expected known result for the Gaussian case. The 
next section approaches the Cauchy distribution case for lower than data 
dimension statistics, where there is no sufficient statistic available and the 
result is novel. 



7 MI Statistics for the Cauchy Distribution 



This section outlines the inference of the one-dimensional MI statistic for 
the one dimensional Cauchy distribution. The detailed steps may be taken 
similarly to those of the last section but taking the Cauchy distribution 
instead of the Gaussian distribution. Take the position parameter of the 
Cauchy to be q, and the the goal is to find £r(x) so that holds. As 
in the last section it is necessary to determine both P(q | r) and P(q \ x). 
Assuming the same ansatz that £r(x) = J2i the necessary convolutions 
may be carried out with the use of the Fourier convolution theorem, with the 
results that 

. | * S 
P(r | q) 



7r(S 2 



P{q 



(r — qS) 2 ) 
S 



tt(S 2 + (r - qS) 2 ) 



and 



P(q | x) oc Yl 



\ tt(1 + (Xi - q) 2 ) 



where S := J2i=i K- Substituting (p8|), (p9[), and (|3 
equation that must be solved for £r(x) 



(28) 
(29) 

(30) 

into (|19D yields the 








1 



r/S — q 



\ tt(1 + ( Xi - q) 2 )) l + (r/5-g) 2 



dq 



\r=t R (X) 



(31) 



Rewriting this equation in more suggestive terms, while taking the scale 
S = 1, gives the result as an implicit equation for (,r(x), 



6?W = J qP(q\ (x, £ R (x)) ) dq. 



(32) 
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The form of the result ( |32[ ) says that £r(x) is the posterior mean of q given 
the data and itself as an additional observation. This form also suggests 
that £r(x) could be the posterior mean of q given the data. However, this 
is not the check using the posterior moment forms derived in || 

immediately shows. Further, assuming a value for £r(x) on the right-hand 
side of (|32|) allows that to be computed in closed form using the results 
of ||. This finally yields that the left-hand side is a rational function of 
the right-hand side, a fixed point equation which may be solved by standard 
iterative methods. Other checks immediately show that the solution is not 
the maximum likelihood solution, nor the median. 

To conclude this section, the one-dimensional MI statistic for the Cauchy 
distribution position parameter has been found as the posterior mean of the 
position parameter of the Cauchy distribution given the data and the MI 
statistic, and this statistic is different from the Bayes' estimator which is the 
posterior mean given the data only. 



8 Approximate MI Inference and Bayes Estimators 



In many cases of interest, if not in all cases of relevance with high dimensional 
data, the convolutions that appear similarly to those in ( pl|) etc. will be quite 
impossible to do in closed form, and probably in a practical sense will even 
be numerically intractable. However, there is an approach that may be taken 
which does some harm to a fully rigorous Bayesian approach, but which may 
be necessary. The idea that is applicable in these cases of difficulty is to 
directly take P(q \ r) in (|20"D to be Gaussian with parameters r = £r(x) = 
(fi(x),a(x)). The approximate MI approach just outlined is applied below 
to finding the approximate MI statistics (fi(x),a(x)). The approximate 
MI approach is then contrasted with an alternative approach using the KL 
distance inverted from that of (20), one that resembles Maximum Entropy 
inference. The rusults of this section hold for any likelihood, as will become 
apparent. 

Take an arbitrary one- dimensional parameterized likelihood parameter- 
ized by q (i.e. with q the parameter of interest). Parameterize the inferred 
distribution P(q \ r) of (|20|) as (see 



P(q\r = (u.,a)) = (f)(fi,a)(q). 



(33) 
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Equations (p0|) and ([!3|) imply that the MI statistic is 

\x = / qP(q | a;) dq 




(34) 



These quantities are the Bayes' estimators for the mean and standard devi- 
ation of the distribution. 

If, on the other hand, the inverted form of the KL distance is taken, as 
it often is in many of the cases we have observed, the statistic /i is 



which, along with another non-linear equation for a, is a complicated non- 
linear system to be solved for r = (/x, cr). 

Note that when the likelihood P{x \ q) is Gaussian these two approxi- 
mate approaches produce the same statistic, the posterior mean and standard 
deviation; but for the Cauchy likelihood, for example, this is not the case, 
with necessity to solve the complicated nonlinear system. In contrast, the 
approximate MI inference technique always produces the posterior Bayes' 
moment estimators. 

The difference between the forms of the approximate MI statistics and 
the inverted-KL statistics appearing in ( p4|) and ( |3~5|) respectively makes it 
clear that one needs a good first-principles approach to the KL distance. 

9 Conclusion 

We have formulated the mutual information based variational principle for 
statistical inference, a fully Bayesian approach to inference, defined MI statis- 
tics for a quantity of interest, shown how the principle may be reformulated 
as a minimal KL distance principle based on posterior distributions, and 
demonstrated how inference proceeds, when lower-than-data dimension suf- 
ficient statistics are absent, using the Cauchy distribution. Finally, an ap- 
proximate approach to the inference of MI statistics was discussed, and the 
relationship of the resulting statistics to Bayes' estimators and the Maximum 
Entropy version of the same approximation was noted. 



fq(f>(V,a)(q) log (P(q | x))dq 



(35) 



J (p(fi,a)(q) log(P(q | x))dq 
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